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ABSTRACT 


This  report  is  intended  as  an  introduction  to  one  possible  approach  to 
the  seismic  classification  problem.  It  develops  a  very  general  classification 
model  using  automatic  nc  n-parametric  learning  based  on  limited  data  of 
known  classification.  Ti  e  model  accepts  discriminants  extracted  from  the 
seismogram  and  yields  the  probability  that  the  input  was  due  to  an  earthquake 
or  an  explosion.  Thus,  '.he  discriminants  are  assumed  to  be  available  as 
inputs.  Pattern  recognition  as  used  here  is  defined,  the  classification  pro¬ 
cedure  is  outlined,  the  adaptive  estimation  of  joint  probability-densities  from 
a  finite  number  of  multi- dimensional  vectors  of  known  classification  (the 
learning  model)  is  discussed,  a  simplified  flow  diagram  of  the  learning 
model  is  presented,  and  the  selection  of  necessary  control  parameters  is 
investigated. 
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SECTION  I 


INTRODUCTION 

A  large  number  of  seismic  events  are  easily  classified  on  the 
basis  of  single  discriminants  such  as  epicenter.  However,  all  of 
the  events  of  interest  cannot  be  categorized  in  this  manner  and 
there  remains  a  subset  of  events  for  which  a  higher  degree  of 
decision-making  sophistication  is  required.  It  is  these  remaining 
events  which  are  of  interest  in  the  following  study.  For  these 
events,  the  backbone  of  the  classification  model  must  be  based  on 
measurements  related  to  the  seismogram. 

It  will  be  assumed  throughout  that  all  events  of  interest  have 
been  a  priori  detected.  Thus,  the  input  waveform  to  the  classifi¬ 
cation  model  will  be  known  to  contain  an  event.  The  objective  will 
then  be  to  separate  the  events  into  the  dichotomy  of  earthquake  or 
nuclear  explosion  or  into  even  finer  categorizations. 

The  development  of  the  classification  model  will  be  divided 
into  two  major  components.  These  are  (i)  the  selection  of  a  set  of 
discriminants  which  Is  capable  of  classifying  the  event  and  (ii) 
the  development  of  a  mathematical  model  utilizing  these  discriminants 
for  accomplishing  the  classification.  This  report  will  emphasize 
the  latter. 

Accordingly,  tha  recognition  system  to  be  considered  here  will 
accept  an  appropriately  chosen  set  of  discriminants  as  its  input  and 
yield  as  its  output,  in  the  simplest  case,  the  probability  that  the 
event  was  an  earthquake  or  an  explosion.  It  will  have  the  ability 
to  utilize  simultaneously  discriminants  taken  from  mixed  domains 
such  as  time,  frequency,  and  frequency-wave  number  and  will 
accomplish  classification  using  the  concepts  of  automatic,  non- 
parametric  pattern  recognition  based  on  limited  input  data.  Thus, 
the  system  will  be  concerned  with  methods  of  automatically 
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establishing  decision  criteria  for  classifying  events  as  meubtrs  of 
one  or  another  class,  when  the  only  information  available  about  any 
class  is  that  which  is  contained  in  a  given  finite  set  of  samples 
(having  unknown  statistics)  known  to  belong  t:o  the  class. 

The  recognition  system  will  be  developed  with  two  distinct 
modes  of  operation;  a  learning  mode  in  which  the  system  is  exposed 
to  a  sequence  of  events,  each  labeled  according  to  the  class  or 
category  to  which  it  belongs,  and  a  recognition  mode  in  which  new 
unlabeled  events  are  classified  as  members  of  one  or  another  of 
these  classes.  During  the  learning  mode,  the  system  develops 
class-criteria  from  the  labeled  events  submitted  to  it,  and  during 
the  recognition  mode  it  uses  these  criteria  for  classifying  unlabeled 
events . 

An  event  will  be  represented  by  an  N -dimensional  vector  or 
point  whose  components  are  the  values  of  the  N  measurable  discrimi¬ 
nants  or  parameters  describing  the  event.  Events  belonging  to  the 
same  category  will  be  represented  by  points  scattered  throughout 
some  region  of  N-dimensional  space  in  accordance  with  an  unknown 
(non-parametrically  learned)  N-dimensional  probability  distribution 
function.  For  the  case  of  two  discriminants  and  two  classes,  the 
hypothetical  two-dimensional  probability  densities  generated  from  a 
limited  set  of  samples  are  shown  in  Figure  1. 
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Figure  I  NON-PARAMETRICALLY  LEARNED  PROBABILITY 
DENSITY  /  PPROXI MATION  BASED  ON  LIMITED  INPUT 
DATA  FOR  A  FUNCTION  OF  TWO  VARIABLES 
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SECTION  II 


PRELIMINARY  DISCUSSION 

Learning  and  recognition  problems  of  pattern  recognition  can  be 
formulated  in  mathematical  terms  as  problems  of  recognition  of 
membership  in  classes.  The  starting  point  of  this  method  is  to  repre¬ 
sent  an  input  (in  our  case  the  seismic  signal)  by  a  set  of  measurements 
variously  called  discriminants,  clues,  features,  receptors,  parameters, 
coordinate  dimensions,  properties  or  attributes.  Accordingly,  in 
this  report,  the  terms  clues  or  discriminants  will  be  used  inter¬ 
changeably  to  describe  measurements  made  on  the  time,  frequency, 
frequency-wave  number,  etc.,  representation  of  the  seismic  signal. 

Each  input  that  belongs  to  a  given  class  (explosion,  earthquake,  etc.) 
will  be  regarded  as  a  vector  in  a  vector  space  which  is  located  at  a 
point  defined  by  the  discriminants.  The  class  will  then  be  repre¬ 
sented  by  the  collection  of  these  points  scattered  in  some  manner  in 
the  vector  space  (often  referred  to  as  an  observation  or  measurement 
space) . 

Members  of  different  classes  are  distributed,  in  general,  in 
different  manners  in  the  space.  Machine  learning  (i.e.  learning 
the  pattern)  is  regarded  as  the  problem  of  determining  the  best  shape 
and  location  (i.e.  best  partitioning)  of  regions  in  the  vector  space 
so  that  the  classes  are  distinguishable.  This  is  illustrated  in 
Figure  2.  Pattern  recognition  or  classification  is  the  act  of  naming 
the  region  in  which  measurements  made  on  an  unknown  seismic  input 
are  contained. 

The  three  major  parts  of  the  pattern  recognition  system  to  be 
used  here  and  their  relationship  to  each  other  are  illustrated  by  the 
block  diagram  of  Figure  3.  This  shows  the  observation  system  that 
represents  the  seismic  input  by  a  set  of  measurements  on  this  input 
or  its  transformations  (discriminants).  The  choice  of  these 
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Flgurt  2.  PARTITIONING  OF  VECTOR  SPACE  INTO  REGIONS 
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Figure  3  GENERAL  PATTERN  RECOGNITION  SYSTEM 


discriminants  is  an  important  problem  which  is  presently  being 
studied.  It  shows  the  “learning  machine11  in  which  seismic  inputs  of 
known  classification  are  processed  (for  developing  a  good  partition 
of  the  vector  space).  And*  it  shows  the  classification  or  recognition 
system  which  evaluates  an  unknown  seismic  input  to  decide  in  which 
partition  of  the  space  it  is  contained. 

There  are  many  ways  of  partitioning  the  vector  space  into 
regions.  However,  statistical  methods  (in  particular,  statistical 
decision  theory)  seem  to  be  a  leading  contender  for  affecting  good 
partitions.  The  applicability  of  decision  theory  in  the  design  of 
pattern  recognition  systems  is  readily  appreciated  by  considering  its 
basic  characteristics.  Once  input  seismic  stimuli  are  expressed  in 
terms  of  a  set  of  discriminants,  we  want  to  design  a  classification 
system  with  the  best,  performance;  i.e.,  one  that  makes  the  least 
number  of  mistakes.  In  addition,  we  recognize  that  the  classification 
system  will  have  to  render  decisions  on  inputs  that  are  not  identical 
to  those  from  which  classification  was  learned  (although  they  will  be 
similar,  in  general*.  It  is  well  known  that  if  we  wish  to  minimize 
the  risk,  the  probaoility  of  error,  or  the  maximum  error  due  to  the 
decision  we  make,  then  we  should  make  our  decision  by  comparing 
likelihood  ratios  to  fixed  thresholds.  That  is  to  say,  if  we  mus’ 
choose  between  two  classes,  explosions  and  earthquakes,  as  giving 
rise  to  the  seismic  stimulus  which  we  observe  through  a  set  of 
measurements  on  the  seismic  waveform  or  its  transformations 
(discriminants),  thin  the  optimum  decision  is  based  on  the  comparison 
of  the  ratio  of  conditional  probability  densities  with  an  appropriately 
chosen  constant.  In  mathematical  form,  this  expresses  the  notion 
that  if  the  set  of  liscr iminants  is  a  more  likely  occurrence  under 
the  assumption  that  the  seismic  stimulus  belongs  t:o  the  class  of 
explosions  than  to  the  class  of  earthquakes,  then  common  sense  (and 
statistical  techniques)  advises  us  to  decide  that  an  explosion 
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probably  gave  rise  to  our  observations.  Thus,  decision  theory 
provides  us  with  a  design  procedure  which  reflects  ultimate  system 
performance  as  the  basis  for  system  design,  and  it  also  agrees  with 
intuition. 

There  is  a  fundamental  difference  between  the  answers  that  are 
derivable  from  standard  statistical  techniques  and  the  answers 
sought  here.  Usually,  decision  theory  assumes  knowledge  of  the 
relative  frequency  of  occurrence  of  every  observable  set  of  discrim¬ 
inants  from  all  classes  of  interest.  Here,  this  state  of  knowledge 
is  missing  and  estimates  of  the  required  quantities  will  be  auto¬ 
matically  made  from  a  finite  number  of  class  samples.  Thus,  we 
recognize  the  fact  that  sparse  seismic  data  with  unknown  statistics 
may  be  available  and  we  design  our  system  to  account  for  this. 
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SECTION  III 


CLASSIFICATION  BY  LIKELIHOOD  FUNCTION  ESTIMATION 

Consider  the  problem  of  deciding  which  of  M  classes  has  given 

rise  to  an  observed  event,  x  =  (x^,  x^ ,  . ..,  x^)  ,  and  suppose  that 

the  statistics  of  e\ents  and  classes  are  known,  i.e.,  the  joint 

probability  density  function  of  x  and  m  is  known,  where  m 

denotes  the  class  lrbel  (m  =  1,  2,  . . . ,  M)  .  The  decision 

theoretical  optimum  method  for  processing  a  measured  event  x  to 

— * 

render  the  classification  is  well  known.  Specifically,  x  should 
be  regarded  as  a  member  of  the  k-th  class  if  the  cost  of  deciding  in 
favor  of  the  k-th  cl.ass  is  less  than  that  of  deciding  in  favor  of 
any  of  the  other  classes.  This  is  stated  in  Equation  1. 

M 

P  p(x|m)  C  (ml  -  C  ^  0  for  all  z  ^  k,  z  =  1,  2,  ...,  (1) 

m  I  is.  z  | 


where 


c  (m> 

Z 


P 

m 

p(x|m) 


the  cost  associated  with  deciding  that  x  belongs 
to  the  z-th  class  when  in  fact  x  belongs  to  the 
m-:h  class, 

thja  a_  priori  probability  that  an  event  from  class 
m  will  occur,  and 

tha  conditional  probability  density  functions  of 
— ♦  — * 

x  ,  given  that  x  belongs  to  the  m-th  class. 


This  method  of  decision-making  minimizes  the  average  risk  associated 

with  the  classifications.  If,  as  is  appropriate  with  many  practical 

classification  problems,  the  cost  is  the  same  for  all  misclassif ica- 

tions,  then  Equation  1  reduces  to  the  following  decision  rule: 

— ♦ 

decide  x  is  a  member  of  the  k-th  class  if 
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Pkp(xtk) 


*  Pzp(x|z) 


for  all  z  ^  k,  z  =  1,  2. 


m 


f2 


Further,  if  the  a  priori  probabilities  are  the  same  for  all  classes 

(P  *  1/M  for  ail  m)  ,  then  Equation  2  becomes  the  following 
m  ,  ^ 

decide  x  is  a  member  of  the  k-th  class  if 


L_\k)  ^  L^(z)  for  all  z  ^  k,  z  a  1,  2S  . m  ,  (3) 

x  x 


where  L  (m)  =  p(x|m)  is  commonly  called  the  likelihood  function  of 
x  — * 

m  given  the  event,  x  .  When  class  a  priori  probabilities  are  the 

sane  r.he  likelihood  function  is  equal  to  the  a  posteriori  probability 

of  class  occurrence*  i.e.,  L^(m)  =  p(x|m)  =  p(m|x)  . 

x 

In  the  event  that  H  of  the  dimensions  of  the  N-dimensicnal 
— » 

input  vector  x  are  not  available  (this  corresponds  to  the  case 
where  H  of  the  selected  set  of  N  discriminants  cannot  be  extracted 
from  the  seismogram)  and  the  classifier  has  been  designed  to  operate 
a*  an  N-parameter  processor,  it  is  not  obvious  what  the  optimum 
v lassif icaticn  decision  based  on  the  N-H  observed  measurements 
consists  of.  However,  a  study  has  been  carried  out  in  the  appendix 
for  determining  the  method  of  making  optimum  decisions  in  this  case. 

It  is  concluded  from  this  study  that  the  optimum  decision  based  on 
N-H  observed  measurements  consists  of  comparing  the  ratio  of  a 
posteriori  probabilities  of  the  actually  observed  N-H  measurements 
with  a  threshold  and  that  no  useful  purpose  is  served  by  knowing  to 
what  higher-dimens ional  process  the  N-H  measurements  belong.  This 
agrees  with  intuition. 

Thus,  we  see  that  if  the  statistics  of  the  events  in  classes  are 
known,  then  an  optimum  (from  the  standpoint  of  minimizing  risk)  method 
oi  establishing  classification  decision  boundaries  in  observation 
space  is  known,  and  the  only  hurdle  which  remains  is  implementation  of 
this  procedure.  Unfortunately,  however,  this  result  can  only  be  used 
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as  a  guide  to  solving  :he  seismic  classification  problem,  because 
the  statistics  of  the  seismic  input  are  usually  not  known  precisely. 

In  particular,  in  seismic  classification,  all  of  the  information 
available  on  the  statistics  of  the  seismic  signal  is  contained  in  the 
values  of  a  finite  number,  N  *  of  labeled  samples  from  each  of  the 
M  classes  or  categories.  However,  one  can  still  proceed  in  this 
situation  by  generating  estimates,  using  the  available  data  samples, 
of  the  likelihood  functions  (or  equivalently,  the  probability  density 
functions)  of  the  different  classes  over  the  observation  space,  and 
rendering  classification  decisions  in  a  manner  dictated  by  decision 
theory  using  the  estimated  quantities  in  lieu  of  the  "true"  functions. 
This  is  the  basis  for  the  classification  method  to  be  discussed  in 
this  report. 

The  process  of  estimating  the  probability  densities  from  labeled 
samples  of  known  classification  can  be  regarded  as  ’’learning",  while 
the  evaluation  of  likelihood  ratios  according  to  optimum  decision 
theory  is  called  "recognition". 

Some  parametric  learning  methods  assume  that  the  functional 
form  of  the  densities  Is  known  (except  for  a  set  of  undetermined 
parameters),  while  non -parametric  learning  methods  deliberately 
assume  no  knowledge  of  the  form  of  the  densities  (although  some 
assumptions  of  their  "well-behaved"  character  is  implicit).  Because 
of  the  complicated  nature  and  uncertainty  of  the  form  of  the  conditional 
joint  probabilities  involved  in  seismic  signal-processing,  expression 
of  the  densities  in  analytical  form  does  not  seem  to  be  a  reasonable 
classification  solution.  Instead  non-parametric  methods  seem  to  be 
a  more  realistic  approach. 

Thus,  the  complexity  of  the  seismic  problem  leads  us  to  consider 
adaptive  (non-parametric)  methods  of  estimating  the  unknown  (multi¬ 
modal)  densities.  In  particular,  classification  might  be  accomplished 
by  storing  non-paramet rically  determined  values  of  the  densities  to 
be  estimated  at  a  sufficiently  large  number  of  points  of  the  vector 
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space. 9  determining  t.he  stored  point  nearest  to  the  unknown  input  x  s 
looking  up  the.  value  of  the  density  at  the  nearest  point  cr  perhaps, 
interpellating  among  stored  values  of  the  densities  near  x  ,  and 
rendering  a  classification  decision  based  on  the  value  of  the  observed 
density.  This  can  be  visualized  in  the  one -dimensional  case  as  shown 
in  Figure  4.  Note  that  fewer  points  can  be  used  to  represent  the 
density  in  the  region  where  the  density  does  not  vary  much,  and  more 
points  can  be  used  where  the  density  varies  rapidly.  Here  the  one- 
dimensional.  probability  density  p(x^)  is  approximated  with  a 

A 

staircase  approximation  p(x^)  .  Similarly,  an  N-dimensionai  density 
involving  the  joint  probability  of  occurrence  of  N  different 
numerical  values  can  be  approximated  by  the  N-dimensiona.L  equivalent 
of  a  staircase  approximation.  Such  an  approximation  of  a  probability 
density  is  a  histogram  in  N  dimensions. 

Since  the  density  function  p(x)  is  approximated  by  a  constant 
in  each  interval,  it  is  obvious  that  only  boundaries  of  the  intervals 
and  the  values  of  the  approximation  must  be  stored.  A  simple  method 
of  evaluating  a  histogram  approximation  at  an  arbitrary  point  can 
thus  be  devised.  The  procedure  hinges  on  the  ability  to  determine 
simply  the  identity  of  the  cell  or  interval,  V  ,  in  which  the  input 
to  be  classified  is  contained  and  then  retrieving  pv  ,  the  corres¬ 
ponding  stored  value  of  this  approximation. 

By  storing  the  location  of  the  centers  of  the  cells  as  a  set  of 
points,  {sv}  ,  where  is  the  stored  center  of  the  V-th  cell,  the 

interior  of  an  arbitrary  cell  t  is  readily  defined  as  the  locus  of 
points  "nearer”  to  than  any  other  stored  point.  The  classifica¬ 

tion  procedure  thus  implied  is: 

1.  Determine  the  stored  point  that  is  "nearer"  to  the 

«H ¥ 

input  vector  x  than  to  any  other  point  S  (v  not  equal 

* 

2.  the^stored  probabilityjiensity  ,<?*>  (approximately 
equal  to  p(x))  to  estimate  p(x)  . 
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Figur#  4  HISTOGRAM  ESTIMATION  USING  UNEQUAL  CELL  SIZES 


3.  Repeat  this  procedure  for  all  classes  and  compute  the  likeli¬ 
hood  ratios,  joint  probabilities,  etc.  necessary  for  classi¬ 
fication. 

One  can  place  this  construction  of  histograms  with  unequal  cell 
sizes  on  an  exact  mathematical  basis  by  asking  (and  solving)  questions 
of  the  type  "What  is  the  optimum  choice  for  the  location,  size  and 

height  of  the  cells  to  minimize  the  expected  error  between  p(x) 

A  — ♦  — * 

and  its  estimate,  p(x)  ?M  However,  since  in  practice  p(x)  is 

unknown  and  must  be  obtained  from  samples,  this  is  the  same  as 

attacking  the  problem  of  how  to  obtain  a  good  histogram  directly  from 

the  samples.  It  is  readily  appreciated  that  cells  representing  the 

distribution  of  a  set  of  known  samples  of  class  k  must  be  located 

in  those  regions  in  the  vector  space  where  members  of  class  k  are 

observed.  Thus,  it  seems  desirable  to  have  members  of  the  class 

create  and  determine  the  locations  and  dimensions  of  the  histogram 

cells.  Since  the  cell  centers  thus  obtained  typify  the  distribution 

of  class  k  ,  the  stored  points  £s v}  will  be  called  ntypical  samples1' 

of  the  class. 

Since  the  interior  of  an  arbitrary  cell  £  in  the  histogram  is 
defined  as  the  locus  of  points  "nearer"  to  than  to  any  other 

stored  point  (V  not  equal  to  £)  one  should  postulate  diatance- 

measures  which  stretch  when  they  measure  "nearness"  to  a  stored  point 
Sv  whose  cell  is  wide,  and  shrink  for  a  narrow  cell.  A  squared 
distance  measure  exhibiting  this  property  is  expressed  by  the  quadratic 
form  Qv(x)  given  by 


(4) 


where  v  identifies  the  cell  and  i  is  the  specific  dimension  of  the 
space  under  consideration.  This  quadratic  form  expresses  the  notion 
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that  the  approximated  density  varies  differently  in  one  cell  than  in 
another,  and  it  also  expressed  coordinate-direction-dependent 
differences  in  the  rate  of  variation  of  the  function.  It  is  an 
expression  of  the  location  and  also  the  shape  of  the  cells  of  the 
N-dimensional  histogram.  Thus,  a  difference  between  parameter  values 
of  the  input  x  and  the  stored  sample  may  be  judged  more 

significant  in  one  neighborhood  of  the  vector  space  than  in  another. 

In  the  event  that  a  new  input  vector  does  not  fall  within  any 
cell,  it  will  be  assumed  that  the  probability  density  is  well  behaved 
and  exhibits  Gaussian  decay  in  regions  where  the  probability  density 
is  small. 
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SECTION  IV 


THE  ADAPTIVE  APPROXIMATION  OF  PROBABILITY  DENSITIES  FROM  LIMITED  DATA 

In  the  method  of  evaluation  of  probability  densities  described 
above,  the  approximated  density  was  described  by  a  set  of  typical 
samples  and  cell  shapes  determined  by  quadratic  forms  specified  by 
means  {s^}  and  var^-ances  •  *n  t^ie  following,  an 

algorithm  is  described  for  generating  the  cells  from  data  in  an 
adaptive  manner  by  accepting  input  samples  of  known  classification 
sequentially.  A  simplified  flow  chart  illustrating  the  procedure  is 
shown  in  Figure  5. 

When  the  first  learning  sample  is  introduced,  a  cell  of  pre¬ 
chosen  size  and  shape  is  created  and  is  centered  on  the  first  learning 
sample.  The  initial  size  and  shape  of  the  cell  is  determined  by 
prior  analysis  of  the  data  (to  be  discussed  in  the  next  section)  as 
part  of  the  initializing  procedure.  The  interior  of  the  cell  is 

defined  by  Equation  5,  the  equation  of  an  ellipsoid  in  N  dimensions, 

2 

where  the  squared  radii  of  the  ellipsoid  are  expressed  by  O  (t)  , 

2  V1 

and  is  a  threshold  control  parameter.  In  Equation  5,  the  symbol 

t  signifies  the  fact  that  the  cell  center  and  shape  are  functions  of 

the  number  of  learning  samples  contained  in  the  n-th  cell  up  to  the 

present  time.  T  will  denote  the  total  number  of  inputs  to  the 

present . 


Qv(x,  t) 


(5) 


The  first  input  vector  becomes  the  first  typical  sample.  This 
plus  an  estimate  of  the  density,  given  by  Equation  6,  is  stored.  The 
density  is  estimated  by  the  ratio  of  the  fraction  of  the  total  number 
of  input  vectors  that  fall  in  a  cell  to  the  volume  of  that  cell. 
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Figure  5.  SIMPUFIEf)  FLOW  DIAGRAM  FOR  A  NON-PARA  METRIC  LEARNING  PROGRAM 
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Except  for  a  constant  (given  in  the  next  section)  that  depends 

on  the  number  of  dimensions,  the  volume  of  the  cell  is  expressed  by 
the  product  of  the  standard  deviations  in  the  quadratic  form  used  to 
define  the  boundaries  of  the  cell. 


p(s.  ,j  t) 


v 


trN 

TLi=i0vi 


(t) 


(6) 


The  second  learning  vector  is  used  to  generate  a  second  cell, 
similar  to  the  first,  if  it  falls  sufficiently  outside  the  first  cell. 
However,  if  the  second  vector  falls  inside  the  first  cell,  the  center 
of  that  cell  is  shifted  to  the  mean  of  the  two  learning  vectors,  the 
shape  and  size  of  the  cell  is  adapted  from  a  better  knowledge  of  the 
local  distribution  of  members  of  the  class,  and  the  local  estimate  of 
the  probability  density  is  updated  accordingly.  If  the  second  vector 
falls  outside  the  first  cell,  but  by  not  a  large  amount,  it  is  stored 
temporarily  to  be  reused  at  a  later  time  according  to  a  procedure  to 
be  described  in  subsequent  paragraphs. 

The  third  and  subsequent  learning  vectors  are  processed  similarly, 
either  generating  new  cells,  updating  old  cells,  or  being  stored 
temporarily  for  later  use.  The  cells  so  generated  for  each  class  are 
located  only  in  the  portion  of  the  vector  space  where  members  of  the 
individual  classes  have  been  observed. 

It  is  seen  through  the  above  discussion  that  as  learning  vectors 
are  introduced  sequentially,  the  cell  in  the  immediate  neighborhood 
of  the  input  vector  changes  shape,  size,  location  and  height.  It  is 
therefore  important  to  examine  the  time-dependency  of  these  cell 
parameters.  Accordingly,  the  variances  that  determine  the  cell  shape 
are  given  by  Equations  7  and  8. 


°Vi(t)  =  max  [°iiC0)’  ?viCt)]  ’ 


(7) 
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(8) 


where 

t  denotes  the  number  of  input  vectors  that  fell  in  the 

v-th  cell  up  to  the  present  time, 
x^(r)  is  the  i-th  coordinate  value  of  the  r-th  input  vector 
falling  in  the  V-th  cell, 

Svl.(t)  is  the  i~th  coordinate  value  of  the  n-th  cell  center 
after  t  contributions  to  the  cell. 

Equation  7  expresses  the  manner  in  which  the  i-th  coordinate  of 

the  V-th  radius  TO  . (t)  grows  if  the  sample  variance  5  .  (t)  of 

N  Vi  °  r  2Vl 

the  t  vectors  in  the  cell  exceeds  the  initial  variance  Ov ,.(0)  . 

Vi 

The  cell  radius  is  never  allowed  to  shrink  to  less  than  the  initial 
value  TNavi^  *  The  reason  for  defining  the  cell  in  this  way  is 
to  encourage  the  cells  to  increase  in  size  as  more  inputs  are  received, 
thus  keeping  the  total  number  of  cells  used  in  the  approximation  of 
the  probability  densit/  small. 

To  insure  that  eazh  ceil  can  grow  while  reducing  the  chance  for 
an  overlapping  coverage  of  the  same  region  of  the  vector  space  by 
several  cells,  an  oute :  control  parameter  (0  ^  1)  is  introduced. 

Thus,  a  vector  x  not  falling  within  an  existing  cell  (as  defined  by 
the  threshold  T^)  is  ased  to  generate  a  new  cell  only  if  it  is  outside 
the  larger  concentric  :ell  defined  by  Equation  9. 

Qv(x,  t)  *  (6tn)2  (9) 

It  is  seen  that  the  quantity  0  expresses  the  ratio  of  the  outer  to 
inner  diameter  of  a  "guard  zone"  within  which  input  vectors  neither 
create  new  cells  nor  update  old  ones. 
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The  input  vectors  which  neither  create  or  update  ceils  are  stored 
temporarily  for  later  use.  As  the  cells  grow  in  size,  these  stored 
vectors  can  be  forced  into  the  existing  cell  structure  without  the 
need  to  create  new  cells. 

After  a  cell  structure  is  obtained  by  the  procedure  described 
above,  we  may  find  that  the  number  of  cells  created  is  larger  than 
the  number  we  would  like  to  have  in  the  N-dimensional  generalized 
histogram.  We  may  force  the  reduction  of  the  number  of  cells  created 
by  altering  the  cell  growth  controlling  parameters  and  0  .  In 

most  cases,  however,  it  can  be  expected  that  a  significant  percentage 
of  the  cells  created  will  contain  very  few  input  vectors,  and,  in 
general,  these  sparcely  populated  cells  will  surround  the  more 
populous  cells.  This  will  happen  because  each  cell  center  (typical 
sample),  after  the  cells  initial  creation,  will  migrate  in  the  vector 
space  and  tend  toward  the  nearest  mode  (local  peak)  of  probability 
density  to  be  approximated.  This  is  readily  seen  from  the  one¬ 
dimensional  illustration  shown  in  Figure  6. 

This  figure  shows  a  small  range  of  the  variable  x^  and  the 
probability  density  p(x^)  in  interval.  The  point  S^^(t) 

represents  the  cell  center  of  i-th  coordinate  of  the  V-th  cell  after 
t  members  fell  into  the  cell.  The  probability  is  greater  that  the 
next  input  is  to  the  right  of  S^(t)  than  that  it  is  to  the  left  of 
that  point.  This  implies  that  the  cell  center  will  move  to  the  right 
after  the  t-plus-first  input  falling  within  the  v-th  cell  is  intro¬ 
duced.  It  is  thus  seen  that  cells  migrate  in  the  direction  of  the 
nearest  modes.  As  cells  move  toward  modes,  and  later  inputs  create 
cells  at  places  from  which  older  cells  have  migrated,  there  will 
always  be  some  cells  which  contain  few  members.  Thus  the  number  of 
cells  can  be  reduced  by  forcing  cell  locations  containing  few 
members  into  the  nearest  cells  whose  members  exceed  a  predetermined 
number . 
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PROBABILITY  THAT  t  +  FIRST  VECTOR 
WILL  BE  TO  THE  RIGHT  OF  S„j  Cl). 


PROBABILITY  THAT  t+  FIRST  VECTOR 
WILL  BE  TO  THE  LEFT  OF  S„j(t) 


Figure  6.  MODE  SEEKING  PROPERTY  OF  CELLS 


SECTION  V 


CONTROL  PARAMETER  SELECTION  THEORY 


In  the  following  paragraphs,  some  of  the  properties  of  the  cell 
growth  mechanism  will  be  discussed.  It  is  desirable  that  the  indi¬ 
vidual  cells  be  adjusted  by  the  data  so  that  a  good  approximation  to 
the  class  probability  density  function  should  be  obtained  with  a 
minimum  number  of  cells.  Furthermore,  the  size  and  shape  of  the 
individual  cells  should  be  determined  by  a  reasonable  and  automatic 
procedure  from  the  data  in  order  to  relieve  the  experimenter  from 
the  almost  impossible  task  of  picking  appropriate  cell  sizes. 

For  simplicity,  consider  an  isolated  cell  V  and  let  x(t) 
be  the  t-th  observation  point  (known  class  member)  that  falls  in  the 
cell,  let  S^(t)  be  the  sample  mean  of  the  first  t  observations 
that  lie  in  the  cell  (i.e.  S^(t)  is  the  center  of  the  cell  at  the  t-th 
step),  let  av(t)  be  a  vector  weighting  parameter  determined 
according  to  Equations  7  and  8,  indicating  the  cell  shape,  and 
finally,  let  be  a  scalar  constant  (the  constant  is  the 

control  parameter  being  studied  here).  Then,  the  cell  is  defined  at 
the  t“th  step  to  be  the  set  of  points  in  the  observation  space  defined 
by  Equation  5  (repeated  below  for  convenience). 


t)  = 


-  SyjCO 


2 


(5) 


Thus,  the  cell  is  the  (ellipsoidal)  locus  of  points  "closer’1  to  the 
cell  mean  Svi(t)  than  TNQv^(t:)  in  the  i-t:h  direction.  It  should 
be  emphasized  again  that  such  a  cell  is  "mode  seeking"  in  that  it  will 
move  (as  a  function  of  t)  in  the  direction  of  the  greatest  concen¬ 
tration  of  data  points.  This  is  a  very  desirable  feature.  The  cell 
is  first  established  according  to  some  rule  by  a  data  point  which 
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does  not  fall  in  any  other  cell  so  that  SV(L)  =  x«[l)  ,  i.e.,  the 
cell  is  initially  centered  about  the  first  or  establishing  data 
point.  If  °v(t)  =  (JV(0)  for  all  t  ,  the  cell  size  and  shape  remains 
the  same  throughout  the  estimation  process.  Then  the  choice  of 
0^(0)  ,  which  is  based  largely  on  physical  considerations  and  intui¬ 
tion,  is  very  critical  and  an  intelligent  choice  is  very  difficult. 

But,  if  °v(t)  is  made  to  depend  on  the  data  sample,  the  volume  of 
the  cell  may  be  made  to  grow  to  an  "optimum"  size  by  proper  choice 
of  the  constant  „  Although  the  cell  might  alternatively  be  made 
to  shrink  if  the  data  indicated  this  were  desirable,  it  is  assumed 
here  that  the  initia  .  cell  size  is  small  compared  to  intervals  in 
which  the  class  probability  density  function  changes  greatly  and, 
hence,  only  cell  expansion  is  discussed  below. 

The  rule  for  upc.ating  the  vector  C^(t)  =[°vl^t^’  av2^fc^’ - » 

o^Ct)!  is  found  from  Equations  7  and  8  as 


r 


avi(t)  =  max 


<V°>’ 


Orl  -  S  .(t) 
Vi'  '  vi^  J 


(10) 


Thus.,  a  (t)  begins  at  a  preset  value  and  normally  grows  to  be  tie 
sample  standard  deviation  of  the  sample  vectors  in  the  cell  neighbor¬ 
hood  . 

The  radius  of  the  cell 5  defined  by  Equation  5  in  the  i-th  coordi¬ 
nate  direction,  is  Cvi^t:^TN  •  The  constant  is  chosen  according 

to  the  theory  to  be  developed  here,  however,  the  initial  cell  radii 
^Vi (0)^n  must  selected  on  the  basis  of  physical  considera¬ 

tions  . 

The  cell  volume  might  be  considered  optimum  if  it  is  as  large 
as  possible  and  still  yields  an  estimated  probability  density  function 
consistent  with  that  obtained  by  estimating  over  smaller  cells.  If 
a  cell  is  located  in  a  region  of  the  observation  space  over  which  the 


class  probability  density  function  is  a  constant,  the  cell  size 
should  expand  until  it  covers  the  region  of  uniform  distribution,. 
Furthermore,  once  the  cell  is  "firmly  established"  in  the  sense 
that  a  number  of  observations  have  fallen  in  the  cell,  the  rate  of 
expansion  should  be  fairly  rapid  provided  it  does  not  grow  substan¬ 
tially  beyond  the  region  of  constant  probability  density  function. 

On  the  other  hand,  if  the  cell  is  initially  located  in  a  region  over 
which  the  class  probability  density  function  is  changings  the  cell 
should  not  expand  rapidly  but  should  migrate  toward  a  node.  There¬ 
fore,  the  rule  for  updating  av(t)  and  the  choice  of  should  be 

such  that  the  expected  cell  behavior  obeys  these  two  intuitive  rules. 

Using  the  above  notions,  we  are  now  in  a  position  to  construct 
a  model  of  the  cell  growth  mechanism  through  a  study  of  the  random 
behavior  of  the  cell.  Accordingly,  the  volume  of  an  N -dimensional 
ellipsoid  is  (the  t  and  V  designators  will  be  temporarily  omitted 
for  convenience) 


N 


i=l 


when 


specifies  the  N-dimensional  ellipsoid  and 


TtN/2 


(ID 


(12) 


(13) 
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A  short  table  of  is  given  below 


N 

1 

2 

3 

4 

5 

6 

7 

1  8 

9 

2 

TT 

4TT 

TT2 

8TT2 

n3 

16tt3 

tt4 

32TT4 

3 

2 

2 

6 

105 

24 

945 

A  slice  perpendicular  to  the  x^-axis  at 
ellipsoid  specified  by 


is  an  (N-l) -dimensional 


(14) 


of  volume 


N-l 


(15) 


Assuming  a  uniform  probability  distribution  over  the  N-dimensional 

ellipsoid  specified  by  Equation  12,  the  probability  density  function 

of  the  x.  coordinate  is 
J 


V 


2  \ 


1  -  A 


N-l 


no. 


N 

k.  n  a. 

X=1 1 


(16) 
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However,  since  the  volume  can  be  obtained  by  integrating  Equation 

15  ever  x.  ,  it  is  found  that 


or 


where 


2 

x . 


N-l 

\  2 


n  cj  .  dx . 
i*  1  J 


V, 


N 


N+l 

2 


N 

n 

i=l 


a. 

i 


(17) 


(18) 


is  the  beta  function.  Accordingly, 

N-l 
2 


,  if  -o  S  X.  Z  a  ,  (19) 


=  0  ,  if  |x  I  >  O  . 


gN(xj) 


!+1) 


N+l 

2 


Mil 


Making  the  transformation  of  variables  x.  =  X  .0 

3  3  . 

density  function  of  X.  is 


the  probability 
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rll +  1 


r|N±i|r 


f! 


1  - 


N-l 

^  2 


X"T  I  ,  for  |  X  |  £  1  , 


(20) 


=  0  ,  for  | X  |  >  1  . 

The  maximum  value  of  gN(x^)  is  an<^  °f  h^(X^)  is 

Sl-l^N  '  Examples  of  h^(X^)  for  several  values  of  N  are  shovm 
in  Figure  ?.  It  will  be  shown  later  that  for  small  X^  ,  h^(X^) 
is  approximately  gaussian. 

The  mean  and  variance  of  X^  are  easily  found  to  be 


v°  • 
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and 
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(22) 


To  show  that  h^(X^)  is  approximately  gaussian  for  small  .X. 
rewrite  Equation  20  as 


VV ' 


rlf+i 


V¥r(sr] 


N-l 

i  2 


1  -  x" 


(23) 


The  first  factor  tends  to  unity  as  N  “*  eo  by  Stirling's  formula. 
The  logarithm  of  the  third  factor  is 


N-l  Ik 


X  4  X  6 
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Hence ,  for  any  fixed  X^  such  that  1  >>  |  — j-  H — + 
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X^(N-l) 
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Accordingly,  in  this  region,  h^(Xj) 
probability  density  function  given  by 


is  approximately  normal,  with 


w * 


N-l 

2tt 


exp 


^  CN-1) 


(25) 


The  mean  of  this  distribution  is  zero  and  the  variance  is  1 


,/N-l  . 


The  tails  of  the  distribution  of  X. .  decreases  much  faster  than  for 

J 

the  gaussian  distribution  (i.e.  where  X. .  is  large)  and  hN(X.,) 
gees  to  zero  at  ±  1  . 


Letting 


0  , 


t  =  1 


Vt)  - 


(26) 


X.(t)  -  S  (t  -  1)  ,  t  >  1  , 


the  t-th  cell  center  is  located  at 


S.(t)  =  S.(t 
J  J 


1)  +  ”  6.(t) 


(27) 


Thus,  substituting  Equation  27  into  Equation  8,  the  cell  sample 
variance  becomes  (omitting  the  V  index  but  including  the  t  index) 
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Let 

t  -  t’  , 


t  =  t  *  be  the  index  of  the  first  sample  point  for  which 
C.(0)  ,  i.e.,  the  first  time  cell  growth  occurs.  Then  for 
by  Equations  5  and  22,  the  expected  value  of  ?^(t)  is 


t-r-l 

t-t 


a?(0)T 

1 

N  +  2 


t-r-l 

t-r 


t  £  t' 


(29) 


or 


2  2 

,  V°>tn 

N  +  2 


as 


(30) 


From  Equations  10  and  29  it  is  seen  that  a  necessary  condition  for 
Qj(t)  to  b< 
to  begin  is 


O^(t)  to  be  greater  than  0^(0)  so  t^iat  cell  growth  may  be  expected 


?j(t)  *  aj(0)  • 


31 


2  2 
a  (o)i 
r  '  n 

N  +  2 


t  -2 

I  Y  izrd.  * 

t  <—>  t-r 
r=0 


aj(0) 


5 


or 


T 


2 

N 


>  N  +  2  , 


(31) 


si  r>ce 


1 

t 


t.-r-l 

t-r 


<  1 


3 


for  r  <  00 


Furthermore,  the  choice  of  determines  not  only  whether  the 

cell  may  be  expected  to  grow,  but  also  the  number  t?  of  observations 
that  must  fall  in  the  cell  before  cell  growth  can  be  expected  to 
begin.  It  is  desirable  that  t  be  chosen  sufficiently  large  to 
establish  a  firm  cell  location  before  the  cell  may  be  expected  to 
grow.  On  the  other  hand,  since  the  amount  of  data  available  for 
probability  density  function  estimation  is  always  limited  in  practice, 
t  must  not  be  too  large. 

Having  chosen  an  appropriate  value  for  t1  ,  the  choice  of  the 
control  parameter  becomes  automatic.  Writing  =  $Vn  +  2  , 

and  considering  3  as  an  unknown,  Equation  29  can  be  solved  for  3  . 


o2.(  0) 


1 

2 


1 


1 

t 


t-r-1 
t:-r  ' 


,  t  ^  t 


(32) 
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However,  since  t 1  is  the  index  of  the  first  sample  point  for  which 


O’ j (t 1 )  >  a* (0)  or  >  aj(0)  ,  it  is  seen  that  for 


1 

2 


(33) 


the  choice  of 


T 


N 


=  3'YnTT 


(34) 


will  result  in  a  beginning  of  cell  growth  after  an  average  of  t1 
sample  points  fall  in  the  cell. 

A  curve  of  P1  as  function  of  t1  is  shown  in  Figure  8.  For 


the  particular  choice  of 


,  the  curve  in  Figure  8 


indicates  that  the  average  value  of  t  will  be  approximately  4.7 
before  cell  growth  begins. 

Since  the  probability  density  functions  of  greatest  interest  will 
more  than  likely  be  non-uniform  over  the  entire  space,  there  will, 
in  general,  be  a  wide  spread  in  the  range  of  cell  probabilities. 
Therefore,  the  cells  with  high  probabilities  will  normally  begin  to 
grow  before  the  majority  of  the  other  cells  have  collected  t! 
observations.  Since  the  growth  of  an  individual  cell  is  limited  by 
the  presence  of  surrounding  cells,  it  is  reasonable  to  expect  that  in 
many  instances  the  cells  located  near  the  modes  of  the  distribution 
will  have  grown  to  their  maximum  limit  by  the  time  an  average  of  t* 
points  have  been  processed  for  each  of  the  cells  in  the  entire  cell 
structure.  This  phenomemon  requires  further  investigation. 

An  investigation  of  the  dynamics  of  the  growth  mechanism  should 
be  carried  out  to  shed  more  light  on  the  method  of  selecting  control 
parameters  as  discussed  here.  Experimentation  should  be  of  value  in 
indicating  if  modifications  to  the  above  theory  are  necessary. 
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SECTION  VI 


SUMMARY 

A  very  general  classification  model  has  been  developed  using  the 
concepts  of  non-parametric  pattern  recognition  based  on  limited  data 
of  known  classification.  The  fundamental  difference  between  decisions 
derivable  from  standard  statistical  techniques  and  decisions  based  on 
this  model  is  that  decision  theory  assumes  knowledge  of  the  relative 
frequency  of  occurrence  of  every  observable  set  of  discriminants  from 
all  classes  of  interest,  while  here,  this  knowledge  is  missing  and 
estimates  of  the  required  statistical  quantities  are  automatically 
made  from  a  finite  number  of  known  class  samples. 

The  model  has  two  distinct  modes  of  operation,  a  learning  mode 
and  a  recognition  mode.  In  the  learning  mode,  partitioning  of  an  N- 
dimensional  parameter  space  (using  discriminants  derived  from  the 
seismic  signal  as  coordinates)  is  accomplished  by  estimating  the 
joint  probability  densities  of  the  parameters  for  each  of  the  input 
classes  in  question.  In  the  recognition  mode,  maximum- likelihood 
ratio  decisions  on  the  estimated  joint  densities  are  made.  It  is 
significant  that,  in  the  learning  mode,  the  estimates  are  formulated 
with  cells  which  adjust  their  size  automatically  according  to  the  data 
so  that  a  good  approximation  to  the  class  density  function  is  obtained 
with  a  minimum  number  of  cells.  It  is  also  significant  that  the  cells 
are  mode-seeking  in  that  they  move  as  new  data  is  introduced  in  the 
direction  of  the  greatest  concentration  of  data  points. 

The  entire  development  has  been  of  necessity  introductory, 
intended  to  give  insight  into  the  broad  concepts  involved.  Thus,  many 
of  the  problems  of  implementing  the  model  and  integrating  it  as  a 
working  part  of  a  seismic  signal  processor  require  further  theoretical 
studies  as  well  as  experimental  verifications. 
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SECTION  VII 


RECOMMENDATIONS  FOR  FURTHER  STUDY 


Only  a  few  of  the  many  study  problems  which  come  to  mind  as  a 
result  of  this  study  are  listed  below.  In  general,  one  is  interested 
in  developing  clustering  transformations,  determining  the  quality  of 
decisions  rendered  by  the  model,  and  gaining  knowledge  ab-'Ut  the 
habits  of  the  model ®s  performance.  In  particular: 

1.  The  cell  growth  mechanism  should  be  studied  theoretically 
in  much  more  detail.  For  example,  the  effects  of  selecting 
0  (0)  and  9  should  be  explored  as  a  function  of  the  size 
of  the  data  sample. 

2.  An  experimental  study  of  the  control  parameter  selcticn 
problem  should  be  undertaken. 

3.  The  accuracy  of  probability  density  estimation  should  be 
experimentally  determined  using  real  and  synthesized  data. 

4.  The  quality  of  the  estimation  procedure,  both  for  the  purpose 
of  determining  the  reliability  of  the  decision  rendered  in 
any  one  instance  and  for  the  purpose  of  modifying  the 
learning  procedure  to  yield  decisions  with  lower  error 
probabilities,  should  be  theoretically  investigated. 

5.  The  proposed  model  should  be  compared  to  other  techniques 

on  the  basis  of  error  probabilities,  complexity  of  implementa¬ 
tion,  etc. 

6.  Transformations  on  the  original  space  should  be  considered 
for  increasing  apparent  class  separability.  One  such  trans¬ 
formation  might  be  concerned  with  minimizing  entropy.  Thus, 
for  the  density  p^(y)  5  one  might  minimize 

Hi<y)  =  "  J  Pi(y)  log  Pi(y)  dy  , 
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since  H^(y)  is  a  function  only  of  the  manner  in  which 
class  members  are  distributed  in  observation  space. 

7.  The  order  dependency  of  the  probability  density  estimation 
technique  should  be  investigated. 
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APPENDIX 


CLASSIFICATION  DECISIONS  BASED  ON  INCOMPLETE  SETS  OF 

OBSERVATIONS 

The  problem  of  classifying  signals  is  generally  treated  as  a 
problem  of  optimally  deciding,  on  the  basis  of  N  observed  measure¬ 
ments  on  that  signal  (x. ,  x0 ,  ...  x__)  ,  which  of  several  classes 

1  l  N 

produced  the  particular  set  of  N  measurements.  If  the  joint 
probability  densities  of  the  N  measurements  are  known  under  all 
assumptions  of  class  membership  for  the  set  of  N  measurements, 
classification  decisions  can  be  rendered  by  computing  the  N-dimensional 
likelihood  ratios  and  then  comparing  these  ratios  with  each  other  or 
with  a  threshold.  With  the  joint  probability  densities  not  known, 
but  a  finite  number  of  measurements  of  each  class  available,  decision 
rules  can  be  devised  which  approach  the  likelihood  ratio  computation 
in  the  limit,  as  the  number  of  measurements  approaches  infinity. 

The  problem  which  will  be  considered  in  this  appendix  concerns 
the  method  of  making  the  optimum  decision  when  not  all  of  the 
N-measurable  parameters  of  an  N -dimensional  process  are  available. 

It  will  be  assumed  for  simplicity  that  the  incomplete  set  of  measure¬ 
ments  may  be  a  member  of  only  one  of  two  classes,  class  E  (earthquake) 
or  class  N  (nuclear  explosion).  It  will  also  be  assumed  that  the 
cost  of  deciding  that  the  set  belongs  to  E  when  indeed  it  is  a  member 
of  N  (the  cost  of  false  dismissal)  is  c^  ,  and  that  the  cost  of 
deciding  in  favor  of  N  when  actually  the  set  belongs  to  E  is  c^ 

(the  cost  of  false  alarm). 

If  all  the  N -measurements  on  an  input  were  available,  and  c^ 
were  equal  to  c^  >  then  a  reasonable  way  of  making  classification 
decisions  would  be  to  decide  using  the  following  equation. 
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This  decision  states  that  if,  given  x^,  x^,  ...  s  E  is  mere 
likely  than  N  ,  then  one  should  decide  that  the  N~dimensicna  L 
vector  x  belongs  to  E.  The  opposite  decision  is  made  if  the 
inequality  does  not  hold.  By  applying  Bayes5  Rule ,  Equation  35 
may  be  written  as  shown  in  Equations  36  and  37.  where  and 
are  the  a  priori  probabilities  of  occurrence  of  class  E  and  N 
respe v tively . 


x  C  E  if 


(37) 


const 


The  function  of  ^(x)  is  the  N-dimensional  likelihood  ratio. 

Several  other  likelihood  ratios  will  be  introduced  later  and  will  be 
distinguished  from  each  other  by  subscripts  which  will  indicate  the 
decision  rule  with  which  the  likelihood  ratios  are  associated. 

Suppose  now  that  of  the  N-measurable  parameters  on  which 
decisions  should  be  based,  only  N-H  are  available.  In  the  following, 
four  of  several  possible  decision  rules  are  discussed  for  deciding, 
from  available  measurements  which  of  the  two  classes  E  or  N  ,  is 
most  likely. 

Decision  Rule  One  (Decisions  Based  on  Marginal  Densities) 

This  rule  states  that,  given  the  measurements  (x^,  x^,  . » 
decide  that  these  measurements  belong  to  E  if  E  is  more  likely 
than  N  as  follows, 


40 


08) 


i y  X2  9 


•  XN-H^  G  E 


p(e|x  ,  x  ,  — ,  x  ) 

if  -xn-1 — - ^  >  i  • 

p(N|x  ,  x  ,  — ,  x  ) 


by  employing  Bayes'  rules,  this  equation  can  be  rewritten  as. 


p(eU1,  x2,  — ■  X„_H)  _  PE(y  *;•  — •  Vh)  >  J 

pCnIxj,  x2,  — ,  XN_H)  P(N)  PN(xl9  x2,  — ,  xN  H)  ' 


(39) 


or 


<xr  v 


P  (x  ,  X  , 

x  „)  e  E  if  -~-± - — 

N-H'  pn^x1’  x'” 


,) 


---,  x 


HlHi  >  mi  =  T  (40) 

)  P(E)  1  *  {*v} 


N-H 


If  we  let 


PE(V  X2  ’  —  ’  Vh> 


pn(V  V 


—  •  W 


\  <V  X2>  — '  XH-H) 


be  the  N~H  dimensional  likelihood  ratio,  abbreviated  as  ^(x)  t0 
simplify  the  notation,  then  decision  rule  1  states  that  we  should 
compare  (x)  with  a  threshold  to  determine  if  x  should  be 
classified  in  E  or  N  .  This,  in  effect,  means  that  if  x^,  •  •  •  »  XN  H 

are  the  only  measurements  made,  these  measurements  alone  should  be  the 
basis  for  decision. 

Decision  Rule  Two  (Decisions  Using  Most  Probable  Values  of  Missing 
Measurements  through  x^) 


41 


After  the  N  *H  measurements  «4x  ,  x0  ...  x  )  art  mad t  the 

I  Z  M  "’ll 

probability  density  of  the  missing  H  measurements  (x  .  through 

JN  "rH"  i 

x  'j  can  be  calculated  and  the  most  probable  values  cf  the  H 
N 

measurements  chosen  for  use  in  the  N-dimensicnal  likelihood  ratio 
— *  — ♦ 

can  be  determined.  The  value  of  the  ratio  £(x)  when  the  most 
probable  values  of  the  H  missing  measurements  are  ustd  is  ^ - 
The  most  probable  values  are  those  which  maximize  the  probability 
density  given  in  Equation  41. 

'  I  \ 

P*‘XN“H+1  ’  XN-H+2  •**  XNIX1'  X2f  XN-H  ' 


Thus. 


P'XK-H+1' 


Vxi 


XN-V 


PU 


N  "H+l 


XN|X1. 


N~H 


(42) 


Accordingly.,  the  decision  rule  states  that  one  should  decide 


PE(X1’ 

V-X>  =  -~± - 

2  ^  - 


<v 

.  X. 


Vh>  c  e  1£ 


N-H’  N-H+l 


^-H+l ' 


V 


XN)  >  PyN) 
P(E) 


=  T 


(-3) 


This  rule  predicts  the  most  likely  values  of  the  missing  measurements 
and  uses  them  as  if  they  had  actually  been  measured. 

Decision  Rule  Three  (Decisions  Using  the  Most  Probable  Value  of  the 
Likelihood  Ratio) 

When  only  N~H  measurements  (xn ,  x0  ,  . x  )  are  made,  the 

1  Z  JN""H 

likelihood  ratio  't(x)  is  a  function  of  the  unmeasured  random  variables 

X. _  u ,  ,  through  x  .  This  is  denoted  by  ^0(x)  and  is  defined  in 
N~trr.L  N  j 

Equation  44, 
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•  •  •  y 


(44) 


•^(x) 


PE(V 


PN(V 


XN-H'  XN-H+1’  •••’  V 

XN-H’  XN-H+1’  XN^ 


Accordingly,  there  is  a  probability  density  p(^(x))  associated  with 
^(x)  so  that  the  most  probable  value  of  the  likelihood  ratio,  given 
the  observed  x^,  . . . ,  x^  ^  measurements  can  be  determined.  Thus, 


p(Vx))*p(Vx)]  •  (45) 

The  decision  rule  is  therefore,  decide 

(x1>  xN_H)  e  E  if  i3(x)  >  T  .  (46) 

Decision  Rule  Four  (Decisions  Based  on  the  Average  Value  of  the 
Likelihood  Ratio) 

In  this  decision  rule,  the  likelihood  ratio  is  again  treated  as 
a  function  of  the  missing  H  measurements.  However,  instead  of  using 
the  most  probable  value  of  as  in  rule  number  3,  the  average 

value  of  this  likelihood  ratio  is  used  as  the  basis  for  deciding 
between  E  and  N.  Thus, 

(x^  xN_H)  e  E  if  ^3(x)  >  T  5  (^7) 


where 

_  00 

V*)  =  l  V*)p[V*)](S  (48) 

Further,  if  ^(x)  is  a  monotonic  function  of  x^  through  x^  , 

Equation  48  may  be  written  as 
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m 

S'*'*  ~  J  ’  "  f  f  S  (xN-H+l V  ■ 


P^XN-H+l’  "‘3  VV  r“  XN  ■H',dxN''H+l. ’  d*N 


(u9) 


To  compare  the  different  decision  rules,  the  probabilities  of  error 
are  computed  and  that  rule  which  yields  the  smallest  error  probability 
is  sought.  The  two  error  rates,  the  probability  of  false  alarm  and 
the  probability  of  false  dismissal,  P(FA)  and  P(FD)  ,  are  given  in 
Equations  50  and  51,  where  Y  is  the  region  in  N-dimensional  space 
in  which  the  decision  rule  in  question  decides  that  the  set  of  measure¬ 
ments  (x^,  . ..,  x^  belongs  to  N  ,  and  Y*  is  the  region  in 

which  the  decision  favors  class  E  .  If  there  are  only  two  classes, 

Y*  is  the  complement  of  Y  . 


P(FA) 


•a 


pEwr 


x 


2’ 


Vdxr 


dx 


2  5 


dx. 


(50) 


F-D>  *1 J  —  J 


pn(*i; 


Vd*i 


dx. 


<51; 


However,  given  the  values  (x  ,  x  ,  x  )  ,  the  likelihood 

— *  L  Z  N  *  rl 

ratios  ^(x)  through  ^(x)  are  all  functions  of  N~H  given  measure¬ 
ments  alone.  Thus,  no  matter  how  complicated  the  likelihood  ratios 
may  be,  they  are,  for  a  specified  choice  of  P^(x^,  x^)  and 

PjjCXf,  xN)  ,  deterministic  functions  of  x^ ,  x^ ,  .  .  .  ,  x^  . 

Thus,  the  integrals  of  Equations  50  and  51  may  be  written  as  shown  in 
Equations  52  and  53,  where  the  region,  y  denotes  the  region  of  N-H 
dimensional  space  in  which  the  measurement  values  are  assigned  to 
class  N  by  the  rule  in  question.  Similarly,  y*  is  the  complement 
of  y 
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(52) 


l  : 

P(FA)  =  j  j  —  J  PE(xr  x2,  — ,  x^_R)dx  ,  dx2,  — ,  dxN_H 
y 


?(FD)  =  j  jf—  J  PN(x1,  x2,  ---,  xN_H)dxr  dx2,  dxN_R  (53) 


If  the  positive  constants  and  c ^  are  the  costs  of  false  alarm 

and  false  dismissal,  PgO^. - >  x^_^)  and  P^x^ - *  Xrr)  are 

the  marginal  densities  of  the  random  processes  E  and  N  ,  and  yg 

and  y  are  the  regions  in  the  N-H  dimensional  subspace  of  measured 
S 

values  in  which  decision  rules  s  and  g  ,  respectively  decide  that 
the  observations  should  belong  to  N  ,  then  rule  s  is  better  than 
rule  g  if  the  inequality  of  Equation  54  holds  in  the  specified 
direction.  Each  side  of  the  inequality  expresses  the  probability  of 
error  according  to  the  corresponding  decision  rule. 

P(E)  C1  J  J  •••  J  PE(V  XN-H)dV  dXN-H  +  P(N)  V 

ys 

I  J  j  VX1»  VH)dxl’  — »  dXN-H 

y  s 

<  P<E>  C1  J  J  —  J  PE<V  — '  XN-H)dxl’  •••»  dXN-H  +  P(N)  C2* 
yg 

J  J  1  PN(V  XN-H)dXl’  •'*’  %-H  ( 

7  g 


Furthermore,  given  the  densities  p  (x^, - >  XN-H^  and  PE^X1’ - ’  XN_H) » 

it  is  seen  that  the  decision  rule  which  minimizes  Q  is  best. 
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Q  -  P(E)  cL  J  J  —  j  PE(xr  xN  H;dxr  .. 

y- 


dx 


N-H 


P(N)  c2  j  j  —  j  PN(xL,  ....  VH)dXlJ  %-H 

y' 


'  i '  \ 

S  *  '  * 


However,  since  there  are  only  two  classes,  Equation  56  allows  simplifi¬ 
cation  of  Equation  55  to  Equation  57. 


f 


j  j  —  j  %<xi 


'  Xtt-H)dV 


'  ’  dxN-H  1 


-  j  J  •••  J  W  •••’  XN-H)dXl . dXN-H 

y„ 


(56) 


Q  =  c2  P(N)  +  j  j  ...  j  Ccj  P(E)  PE(x1,  ....  xN_H) 

y. 


c  2  P  (N )  Cx  ^  ,  .  .  .  ,  )  dx  j  ,  •  •  •  j  dx^ 


(57) 


It  is  seen  that  Q  is  smallest  of  is  in  the  region  in  which  the 

integral  is  always  negative.  In  this  region, 


C2  P(N)  PN(X1J  xn-H^  >  C1  P<^  PE(X15  XN-V  ‘  ( :58) 


Further,  for  the  case  where  c\  =  c2  y  this  reduces  to  the  decision 
rule;  decide  N  if 
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(59) 


pN(x.r  V 

PE(xl’  X2 ’ 


N-H 


N-H 


) 

> 


> 


mi  =  T 

P(N) 


We  recognize  that  this  is  just  the  decision  made  by  the  marginal 
densities  of  decision  rule  1. 

Thus,  it  is  seen  that  the  optimum  classification  decision  based 
on  N-4I  observed  measurements  of  the  set  of  N  measurements  consists 
of  comparing  the  ratio  of  a  posteriori  probabilities  of  these  observed 
measurements  with  a  threshold  and  that  no  useful  purpose  is  served  by 
knowing  to  what  higher  dimensional  process  the  N-H  measurements 
belong . 
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